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AND Larry Wasserman 

Carnegie Mellon University 

We introduce a new version of forward stepwise regression. Our 
modification finds solutions to regression problems where the selected 
predictors appear in a structured pattern, with respect to a predefined 
distance measure over the candidate predictors. Our method is moti- 
vated by the problem of predicting HIV-1 drug resistance from pro- 
tein sequences. We find that our method improves the interpretability 
of drug resistance while producing comparable predictive accuracy to 
standard methods. We also demonstrate our method in a simulation 
study and present some theoretical results and connections. 

1. Introduction. About twenty antiretroviral drugs are currently avail- 
able for the treatment of human immunodeficiency virus type 1 (HIV-1). 
The great majority of these function by inhibiting the activity of various 
proteins produced by the HIV-1 virus, effectively impairing the virus' ability 
to reproduce. Resistance to these drugs develops when a mutation changes 
the structure of the target protein enough to frustrate the drug while still 
maintaining the function of the protein. HIV-1 is capable of rapid mutation, 
and is thus often able to adapt to antiretroviral therapy. Understanding the 
genetic basis for this developed resistance would allow more effective devel- 
opment of new drugs, as well as more informed prescription of the currently 
available drugs. 

Sequencing HIV-1 proteins can be done reliably, and well-designed in- vitro 
experiments are available for testing the resistance of a particular strain of 
HIV-1 to drugs; see Petropoulos et al. (2000) and Zhang et al. (2005). We 
approach this problem using regression. This problem setting leads us to 
build models to predict drug resistance using mutations in the amino acid 
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sequence of the target proteins. We desire models that are easy to interpret 
and take into account properties of proteins and amino acids. In particular, 
it is well known that proteins generally function using areas called active 
sites, that are, simply, areas of the sequence where the protein binds or 
otherwise interacts with other molecules. This fact leads us to believe that 
important mutations will tend to be clustered around such sites. 

Protein sequences can be thought to have two layers of structure: the 
primary sequence consisting of a single string of adjacent amino acids, and 
a secondary structure created by protein folding. We can measure the dis- 
tance between amino acids in a protein sequence roughly using the dif- 
ferences in position in the primary sequence. When the protein's folding 
structure is known, three-dimensional distance can be calculated for any 
two amino acid positions. But even when the structure of the protein is 
unknown, because of the continuity of the primary sequence, clustering in 
three-dimensional space generally corresponds to clustering in the protein 
primary sequence. 

We therefore build models for predicting resistance from mutations that 
have the following two properties: (1) Sparsity — a model that uses only 
a few mutations is easier to interpret and apply. (2) Structure — following 
the concept of active sites, we wish to use mutations that are clustered in 
the protein primary sequence. Note that this second property is desirable 
in other applications. For example, Liu, Lin and Ghosh (2007) use genetic 
pathways to model the genetic influences on prostate cancer. These pathways 
can be modeled as a structure on individual genes. In this paper we introduce 
a variable selection method that builds regression models that satisfy these 
two properties. 

Forward stepwise regression and the lasso are two popular automatic 
variable selection techniques that are effective at finding sparse regression 
models. Given data {Xi,Yi), . . . ,{Xn,Yn) where 1^ e M and Xi G W, the 
lasso /3iasso estimator due to Tibshirani (1996) minimizes 

n 

(1) 5^(y.-Xf/3)2 + A||/3||i, 

i=l 

where ||/3||i = \f3j\ and A > is a tuning parameter which controls the 
amount of regularization. Forward stepwise regression is a greedy method 
that adds one predictor, that is, one element Xi, at a time. Both produce 
sparse solutions, meaning that f3j = for most j. Sparse solutions are at- 
tractive both computationally and for interpretation. 

Recent results show that both methods yield estimators with good prop- 
erties. See Bunea, Tsybakov and Wegkamp (2007), Greenshtein and Ritov 
(2004), Wainwright (2007) for results on the lasso, and Barron et al. (2008) 
for results on forward stepwise regression. These papers show that, under 
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weak conditions, both approaches yield predictors that are 0(n~^/^) close to 
the optimal sparse linear predictor. Moreover, this rate cannot be improved. 
In our application, extra information is available — we expect nonzero /3j's to 
cluster together. In this case, we would like to add an additional constraint 
to the regression. 

In this paper we introduce a modification of forward stepwise regression 
that encourages the selection of new predictors that are "close" — with re- 
spect to a distance measure over the predictors — to those already included 
in the model. We show that our method. Clustered and Sparse Regression 
(CaSpaR), is useful in regression problems where we desire both a sparse 
and structured solution. 

2. Data. The Stanford HIV drug resistance database described in Rhee 
et al. (2003) is a large data set of HIV-1 protease sequences, along with 
resistance phenotypes for up to seven different protease inhibitor (PI) drugs 
for each sequence. This database is a combination of smaller data sets col- 
lected in different clinical trials. Since both the genotyping and phenotyping 
experiments are well standardized, such a joining of data will not give rise 
to significant heterogeneity-in-sample concerns. Each protease protein se- 
quence is 99 amino acids long. The phenotypes are obtained from in-vitro 
experiments, and are measured in terms of number of multiples of standard 
dose of drug needed to suppress virus reproduction. 

We can cast the problem of connecting genotype to phenotype as a re- 
gression problem by treating each mutation as a predictor. Previous studies 
by Rhee et al. (2006) and Beerenwinkel et al. (2003) have used most modern 
sparse regression and classification techniques to attack this problem. We 
seek a model that will take into account protein active sites. 

3. CaSpaR. We first introduce the usual regression setting. We have an 
n X p data matrix X and n x 1 response vector Y. We use the usual linear 
model 

(2) Y = X/3 + e. 
Define the support of /3 by 

(3) supp(/3) = {j:/3j^O,j = l,...,p}. 

We assume that /? is sparse (most /3j's are 0) and also that supp(/3) has 
structure. We base this structure on a distance measure d{-,-) over the set 
of predictors: 



(4) 



d{;-):{l,...,p}x{l,...,p}^R. 
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Table 1 
Forward stepwise regression 



1. Input: ^ = 0, X, Y, e>0. 

2. Fit an OLS model: /3 = argmin^j HX/J - y\\l, s.t. supp(/?) C A. 

3. Set i* =argmax{i^A} |(X/3-y)^Xi|. 

4. If \xj* (X/3 — y)| < £ then stop, else set A — Avji* and go to step 2. 



Specifically, we assume that the nonzero elements of /3 are spatially clustered 
with respect to d{-, •). In other words, the nonzero entries of j3 appear in some 
number of groups in which the members are "close" to each other — as defined 
by d{-,-). Our goal is to accurately recover /3, with particular emphasis on 
this sparsity structure. 

We want to modify a sparse regression technique to produce solutions 
with clusters of nonzero coefficients. Penalized techniques such as the lasso 
are difficult to modify for this purpose. Recall that the lasso finds (3 that 
minimizes 

n 

(5) Q(/3) = 5](l^,-Xf/3)2 + A5^|/3,|. 

i=l j 

The lasso is computationally efficient because Q{/3) is convex. It is difficult to 
add a penalty to Q{P) that encourages clustered solutions while maintaining 
convexity. Note that the fused lasso due to Tibshirani et al. (2005) adds 
a penalty of the form |/3j — This forces nearby coefficients to be 

close together in sign and magnitude. We want the support points to be 
close together, but we do not want to force the values of the coefficients 
to be close together. Instead, we are only concerned with the inclusion or 
exclusion of predictors. 

Stepwise procedures are more flexible and easier to modify, because we do 
not need to worry about maintaining the convexity of an objective function. 
We therefore propose a modification to forward stepwise regression (see Ta- 
ble 1 for a description of forward stepwise regression) . We call our algorithm 
CaSpaR (Clustered and Sparse Regression); see Table 2. 

Table 2 

CaSpaR: Clustered and Sparse Regression 

1. Input: A = 0, X, Y, ft > 0, a e (0, 1), e > 0. 

2. Fit an OLS model: /3 = argmiug ||X/3 - y||2, s.t. supp(/?) C A. 

3. VZ ^ A, calculate: Wi — ^ 'l2{keA} Kh {d{l, k)). If this is the first iteration of the 
algorithm, set Wi = 1, VI. 

4. Set r = a.rgm^a^^l^A}Wl\{X|3 - y)'^xi\. 

5. If \xf, (X/3 — y)| < e then stop, else set A — AuT and go to step 2. 
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In each iteration of forward stepwise regression, the following quantities 
are used to select the next predictor to be added to the model: 

(6) C7, = |(X/3-Yfx,|, 

where Xj denotes the jth column of X. Note that the Cj are proportional to 
the correlations between each candidate predictor and the current residuals 
if the columns of X are scaled to empirical mean zero, variance one. We 
wish to encourage the selection of predictors that are close, with respect 
to d{-, •), to those already in the model. To do this, we multiply the Cj by 
a kernel, which we construct based on the current active set A. This kernel 
will weight the Cj so that predictors that are close to those already in the 
model receive more weight than those that are not. 

Formally, suppose we have a kernel Kh that is centered at 0, where h de- 
notes the bandwidth parameter. Then, for all I ^ A, we compute 

(7) ^i = AiYl K,,{d{l,k)). 

' ' {leA} 

We then select the next predictor j* using a weighted criterion: Wj(X./3 — 
y)^Xj. For most familiar kernels, such as a Gaussian kernel or an Epanech- 
nikov kernel, this has the effect of boosting the criterion for predictors "near" 
those already included in the model, and diminishing the criterion for those 
"far away." For practical application, we recommend a mixture of a familiar 
kernel, such as a boxcar or Epanechnikov, and a uniform distribution. This 
mixture, which we call the Stetson kernel, introduces an additional mixing 
parameter a: 

(8) Kh,a{x) = a + {l-a)Kh{d{x)), 

where Kh is a kernel such as a boxcar, Epanechnikov or Gaussian. An ex- 
ample of this kernel appears in Figure 1. We particularly recommend the 
Epanechnikov or the boxcar kernel, because these kernels have no impact at 
all on predictors outside their bandwidth, and so Wj = a for these predictors. 
While this usually makes no difference in predictor selection, it simplifies 
precise computation and interpretation. 




Fig. 1. The Stetson kernel, with an Epanechnikov kernel. 
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The advantage of the Stetson kernel is that this mixture allows multiple 
groups of predictors to appear in the sparsity structure. If we were instead to 
only use a familiar kernel, then we would have Wj = (or extremely small) 
for those j far enough away from predictors already included in the model. 
This approach would lead to only a single group in the sparsity structure, 
built around the first selected predictor, whereas most applications call for 
multiple groups. The Stetson kernel avoids this problem. The uniform part 
of the Stetson kernel allows new predictor groups to enter the model. The 
kernel part of the mixture encourages clustering around predictors already 
included in the model. 

Finally, note that CaSpaR is closely related to forward stepwise regres- 
sion. Indeed, with a = 1 CaSpaR reduces to forward stepwise regression. 
Therefore, as long we consider a = 1 when picking parameters, we always 
consider the forward stepwise regression solution. Consequently, we have 
a loose guarantee that CaSpaR does no worse than forward stepwise regres- 
sion. Moreover, we expect that some theoretical results relating to forward 
stepwise regression can be adapted to CaSpaR. 

3.1. Tuning parameters. CaSpaR has three tuning parameters: e, h and a. 
The parameter e controls the sparsity of the fitted model. The parameters h 
and a control the amount of structure in the estimated support. For the Stet- 
son kernel, as the bandwidth h decreases, the predictors become more tightly 
grouped. As a increases, new clusters are allowed to form more easily. In 
the special case where a = 1, the method reduces to the usual forward step- 
wise regression method. Let CV{£,h,a) denote the cross-validation score. 
We choose the parameters by minimizing CV{£, h, a). Note that since small 
changes in /i or a do not affect the order of predictor selection, this tuning 
can be accomplished using a simple grid search. 

4. Results. We now return to our application to HIV drug resistance. 
Our data set consists of 553 amino acid sequences, all 99 amino acids in 
length. Each amino acid sequence corresponds to a different strain of HIV 
found within a patient. Each sequence has resistance measurements for up 
to seven HIV inhibiting drugs. Thus, the number of sequences available for 
our analysis varies depending on which drug we consider. 

After we choose a drug and take the appropriate subset of our 553 se- 
quences, we create our predictors. With twenty known amino acids, each 
position in these sequences thus takes twenty possible values. We thus de- 
fine our mutation predictors as follows. At each of the 99 positions, we first 
search across all of the available sequences and record the set of amino acids 
that appear at that position in the data. This set is the collection of possible 
mutations at that particular position. If there is only one amino acid in this 
set, this corresponds to the case where that particular position displays no 
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variation in amino acid over the data, and thus can be dropped from the 
analysis. We use mutations from positions with M > 1 possible amino acids 
to create M — 1 predictors. Each of these predictors is an indicator vari- 
able which, for a particular sequence, is equal to 1 if the particular amino 
acid appears at that particular position and otherwise. We refer to these 
predictors as mutations. Since each mutation has an associated position in 
the primary sequence, we can define a distance between predictors as the 
absolute difference of their positions. Thus, the mutations that occur at the 
same position are distance from each other. 

Our design matrix X is thus an ridrug x Pdmg matrix. Here njrug is the 
number of sequences with measurements of the resistance score for the drug 
of interest. The number of sequences with resistance measurements for each 
drug are as follows: 453 for drug APV, 212 for ATV, 496 for IDV, 300 for 
LPV, 510 for NFV, 465 for RTV, and 493 for SQV. We then create the 
Pdmg mutation indicator predictors as described above. Since the number of 
samples varies with the drug, so does the number of mutation predictors. The 
number of predictors for each drug are as follows: 210 for drug APV, 180 for 
ATV, 215 for IDV, 199 for LPV, 219 for NFV, 215 for RTV, and 218 for SQV. 

We compare CaSpaR to forward stepwise regression and lasso models. 
For all methods, we use ten- fold cross-validation to choose all the tuning 
parameters. We use the R package gimnet [Friedman, Hastie and Tibshirani 
(2010)] to obtain the lasso solution. For CaSpaR, we use the Stetson kernel 
and perform a grid search over a = {0, 0.1, 0.2, . . . , 1}, and over h = {1, 2, 3, 4} 
to find the optimal tuning parameters. 

We present a summary of our results in Table 3. Compared to stepwise 
regression, CaSpaR has comparable mean-squared-error (MSE) and number 

Table 3 

Summary of results across all models and drugs. For each model, we give the 
mean-square-error, as well as the number of mutations (predictors) selected in 
parentheses. We see that CaSpaR is comparable to forward stepwise regression in terms 
of MSE, with about the same number of predictors included in the model. The lasso does 

better in MSE, but includes many more mutations than either stepwise method. As we 
previously noted, neither forward stepwise regression nor the lasso allows for a structured 

sparse solution 



Drug name 


Stepwise 


CaSpaR 


Lasso 


APV 


0.514 (7) 


0.477 (14) 


0.422 (51) 


ATV 


0.588 (6) 


0.494 (11) 


0.477 (39) 


IDV 


0.541 (13) 


0.580 (10) 


0.449 (77) 


LPV 


0.614 (5) 


0.507 (15) 


0.518 (35) 


NFV 


0.650 (19) 


0.637 (22) 


0.661 (40) 


RTV 


0.659 (8) 


0.714 (5) 


0.570 (58) 


SQV 


0.426 (31) 


0.508 (21) 


0.447 (63) 



8 



PERCIVAL, ROEDER, ROSENFELD AND WASSERMAN 



Stepwise & CaSpaR Model Comparisons 
APV (Stepwise) ATV (Stepwise) RTV (Stepwise) SQV (Stepwise) 



3.5 
1.8 
0.0 
-1.8 
-3.5 

3.5 
1.8 
0.0 
-1.8 
-3.5 



1 1 


1 


1 
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50 
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Fig. 2. Comparison of stepwise and CaSpaR models across four drugs: APV, ATV, 
RTV and SQV. Each plot gives the coefficients for the selected mutation predictors, versus 
the locations of these mutations in the protein sequence. Each vertical line represents the 
magnitude of the coefficient for a mutation predictor. Note that some sequence locations 
can have multiple mutations. 



of mutations selected. In most cases, CaSpaR selects a few more mutations 
and has a slightly lower MSE. The lasso generally does better in terms of 
MSE, but includes many more mutations. These results are complicated 
and cumbersome to interpret as a model of resistance. Overall, CaSpaR 
gives relatively sparse models, as desired. 

Figure 2 compares the sparsity structure in the CaSpaR and stepwise 
solutions in four of the drugs. If we compare the sparsity patterns of the 
stepwise and CaSpaR solutions, we see that CaSpaR gives more clustered 
solutions, as expected. As mentioned before, CaSpaR and stepwise regression 
select about the same number of mutations. The clustered CaSpaR solutions, 
however, select mutations from fewer positions than stepwise regression. The 
CaSpaR models therefore give a comparable level of prediction accuracy and 
sparsity, while also having a better biological interpretation: these clusters 
may correspond to a functional area of the protein. 

5. Simulation study. We next report the results of a simulation study. 
We show that CaSpaR recovers a structured sparsity pattern more effectively 
than forward stepwise regression and lasso. For CaSpaR, we use a Stetson 
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kernel, and tune the parameters with a grid of /i = {1, 2, 3, 4}, and a = 
{0.1, 0.2, . . . , 1}. For each method, we use 10-fold cross-validation to choose 
all tuning parameters and stopping times. To measure the performance of 
each method, we use 

llfl fll|2 

(9) Recovery Error = - 



where /3 is the coefficient estimated by the method and /3 is the true coeffi- 
cient vector. This metric appeals to us since it captures both selection and 
estimation performance. We also compare the true positive rate and false 
positive rate in order to directly measure selection performance. Here, a true 
positive is when a nonzero entry of /3 is also nonzero in f3. A false positive 
is when a nonzero entry of f3 is zero in f3. 

We simulate 100 nxp data matrices X with p = 250 columns. Each entry 
of these X is an i.i.d. draw from a standard normal distribution. We generate 
100 corresponding true coefficient vectors /3 so that each has 7 groups of 5 
nonzero coefficients, randomly placed. Thus, there are 35 nonzero entries in 
each (3. Within each nonzero group, we set one entry of /? equal to 6, and 
the rest equal to 3 (see the top panel of Figure 3 for a display of a sample 




Fig. 3. Recovery of coefficients for a single simulated data set (n — 100). The top panel 
displays the target coefficient vector. The next three panels show the estimated coefficients 
for Stepwise, CaSpaR and LASSO, having recovery errors 0.848, 0.059, 0.542, respectively. 
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Simulation: Recovery Comparison 
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Fig. 4. Recovery error — /SHi/ll/SHi) "n simulated data with 1- dimensional structured 
sparsity. Black points: stepwise regression; green points: lasso; red points: CaSpaR. We 
can see that with less data CaSpaR achieves a much better recovery rate than either of the 
other two methods. 



coefficient vector). We then randomize the signs of the nonzero entries. We 
add independent Gaussian noise with variance 1 to the simulated response. 

To compare the three methods, we increase n from 50 to 150 (n = 50, 75, 
100, 125, 150) and compare the average recovery errors of the three methods; 
cf. Figure 4. CaSpaR gives near-optimal performance with fewer data points 
than the other methods. An example of the differences in performance be- 
tween the three methods on a single simulated data set (n = 100) is given 
in Figure 3. CaSpaR recovers the signal well, while the other two methods 
do not. Figures 5 and 6 display a comparison of the true positive rates and 
false positive rates of the three methods. We see that CaSpaR achieves the 
best balance of these two properties, with near optimal performance when 
n = 150 — a property not seen with stepwise regression or the lasso. We there- 
fore conclude that CaSpaR can reconstruct sparse signals more effectively 
than stepwise regression or the lasso. 

6. Theoretical properties. In this section we discuss the theoretical prop- 
erties of CaSpaR. We begin by explaining how CaSpaR relates to other 
methods. 



6.1. Related work. Several existing regression methods take into account 
structure as well as sparsity. Yuan and Lin (2006) introduced the grouped 
lasso, which allows only groups of predictors to be selected at once. This is 
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Fig. 5. True positive rate (number of correctly identified nonzero entries of P in j3 /total 
number of nonzero entries of P) on simulated data with 1-dimensional structured sparsity. 
Black points: stepwise regression; green points: lasso; red points: CaSpaR. CaSpaR is com- 
petitive with the other two methods. Note that the superior true positive rate of the lasso 
comes at the cost of a high rate of false positives. 



Simulation: Selection Comparison 




50 75 100 125 150 

Sample Size 



Fig. 6. False negative rate (number of incorrect nonzero entries of /3 with respect to 
P /total number of nonzero entries of P) on simulated data with 1-dimensional structured 
sparsity. black points: stepwise regression; green points: lasso; red points: CaSpaR. CaSpaR 
achieves a superior rate to the lasso. Note that the y-axis range is [0,0.2]. 
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desirable when the groups represent closely linked predictors — such as a set 
of predictors that code the levels of a multilevel factor predictor. Since this 
method modifies the lasso, it can be cast as a convex minimization problem. 
However, the groups have to be predefined, and the method does not allow 
for overlap between groups, making this method somewhat inflexible. 

Huang, Zhang and Metaxas (2009) introduced an algorithm called Struc- 
tOMP that modifies forward stepwise regression (also known as orthogonal 
matching pursuit or OMP). Here, the desired sparsity structure is encoded 
as a set of blocks, each of which is assigned a cost. The algorithm proceeds 
by greedily adding blocks one at a time to reduce the loss, scaled by the cost 
of the added block. StructOMP allows for very flexible sparsity structures. 
In particular, it can approximate a general class of sparsity structures the 
authors term graph sparsity, which we discuss in Section 6.2. 

Recent work by Jacob, Obozinski and Vert (2009) relating to the grouped 
lasso extends the possible group structures to include overlapping groups. 
Like StructOMP, the overlapping group penalty can produce models that 
approximately follow graph sparsity. This approach has the advantage of 
being a convex minimization problem. As we discuss in the next section, for 
graph sparsity, this method, like StructOMP, gives only an approximation 
to graph sparsity because of computational considerations. 

6.2. Graph sparsity. Graph sparsity is a specific type of structured spar- 
sity introduced by Huang, Zhang and Metaxas (2009). Consider a graph G 
whose nodes include the set X = {1, 2, ... Thus, each predictor is a node 
of G, but for generality we allow other nodes to be in the graph as well. We 
then define the neighborhood of a node v as the set of nodes with an edge 
connecting it to v. More generally, we could allow for /^-neighborhoods — the 
set of all nodes with a path of at most k edges connecting it to v. We then 
consider a sparsity structure where the important predictors appear within 
neighborhoods, or a series of connected neighborhoods. 

For example, consider a grid graph, such as in the case of a pixelated im- 
age. Each pixel is connected to four neighbors, one to each cardinal direction. 
The sparsity structure for this graph connects visually related components 
in the image. 

CaSpaR can approximate graph sparsity if we employ an appropriate 
distance function and bandwidth. Given a graph G, the distance function 
can be defined in terms of the graph: 

(10) d{l,m) = min{Length of paths from I to m, as defined by G}. 

More generally, each edge can be weighted, and d{-,-) can be the min- 
imal weighted path length. We then can define neighborhood size via the 
bandwidth h. For the Stetson kernel, the mixing parameter a controls the 
number of connected neighborhoods, where a = allows only one. In the 
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image example, we can define d{-, ■) as above. Then, with h € (1,2), CaSpaR 
would find a sparsity structure of connected pixels. 

CaSpaR is a very flexible way to approximate graph sparsity. First, it 
allows for neighborhoods to be locally defined through the bandwidth while 
still allowing neighborhoods to grow arbitrarily large as the method pro- 
ceeds. Second, when used with the Stetson kernel, CaSpaR allows the user 
to control the degree to which graph sparsity is adhered via the mixing 
parameter a. 

In comparison, the algorithms for the StructOMP of Huang, Zhang and 
Metaxas (2009) and graph lasso of Jacob, Obozinski and Vert (2009) approx- 
imate graph sparsity by constructing a set of node neighborhoods, based on 
the graph structure. These generate a set of blocks or groups, that are then 
used in the OMP or group lasso framework, respectively. However, to con- 
trol the computational cost, they limit the neighborhood size used to make 
these blocks or groups. Because CaSpaR grows neighborhoods instead of 
seeking to add them all at once as a group or block, this is not necessary. 
These algorithms can handle large groups or blocks, but only at significant 
computational cost. 

Further, in StructOMP, there is no clear way to control the degree to 
which graph sparsity is followed in the solution. The blocks are each assigned 
a cost, but this cost is relatively restrictive. In graph lasso, the group penalty 
is controlled by a parameter A, just as with the ii lasso penalty. However, the 
group penalty controls sparsity as well as the structure, so as A decreases, the 
model becomes less sparse as well as less structured. A separate £i penalty 
could allow the model to be controlled for sparsity and structure separately. 

6.3. Consistency. We now explain how a result in Zhang (2009) on step- 
wise regression can be adapted to CaSpaR. We summarize the result from 
the literature as follows: under assumptions about the data matrix and the 
response, it can be shown that with high probability, when the forward step- 
wise procedure stops, it stops with all correctly selected predictors — that is, 
all the nonzero entries of the final /3 are also nonzero in the true target /3. 
Note that there may be additional "false negatives." Moreover, if all of the 
target coefficients are above a threshold set by the noise level, then the entire 
sparsity pattern is captured exactly. 

We closely follow the proof in Zhang (2009). This result requires more 
conditions than the similar result for stepwise regression. However, since we 
assume that we have a certain set of tuning parameters {a,h}, the assump- 
tions are not too harsh. For ease of reference, we use notation similar to 
Zhang (2009). 

We have an n xp matrix X consisting of p n-vectors {xi,X2, . . . ,Xp}, and 
an ?i-vector y. We assume that there is a target f3 E M^, such that 




Ey = Xp. 
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This assumption means that the Unear model is correct. It also roughly 
means there is a target coefficient vector /3 that estimates y well, relative to 
the noise level. For both stepwise and CaSpaR methods, we define (3^^^ as 
the coefficient vector after the fcth step. Recall the definition of the support 
of a vector: 

(12) supp(/3) = {i:/3,/0}. 
We then define F^^'^ = supp(/3('=)), F = supp(;5). Let 

(13) /5x(-P', y) = arg min ||X/3 — y||2 subject to supp(/3) C F. 



Finally, we define two technical quantities: 

(14) fix(F) = max || (X^X-p)-^ X^^, ||i 

and 

(15) px(F)=Ml-\\Xf3g/m\l.snpp{l3)cF 



(3 In 

For CaSpaR, we define a distance measure on our predictor index 1,2,. . . ,p: 
d{-, •). We assume that we are using a boxcar kernel, or a Stetson kernel with 
a boxcar kernel: Kh,m{l) = Id{rnd{k,i)<h- We then define the following set, 
which represents the candidate predictors — predictors not already included 
in the model — "underneath" the kernel: 

(16) A^'^) = {m:d(Z,m) </i,m^F(^')}. 

It follows that 



(Ij) ra + (l-a)//c:jGA(^), 



Finally, recall that we have e as the stopping criterion for CaSpaR. If at 
step k we select Xj^^) as the next predictor to be included in the model, then 
if 

(18) |x5,)(X/3('=-i)-y)|<e, 
CaSpaR stops at step k — I. 



Theorem 1. Suppose that: 

1. i|x,-||2 = l VjGl,2,...,p. 

2. 3/3 gRP, with F = supj){/3) s.t. y = Xp. 

3. iix{F)<l. 

4. px{F)>^. 
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5. The elements of y; [yi]i=i,2,...,n independent sub-Gaussian random 
variables: 3a > s.t. \fi,\ft G R,Ee^^y^-^y^^ < e'^'*'/^^ 

6. Given i] G (0, 1), let the stopping criterion satisfy 

e > 3— =-aV21og(2p/r/). 

l-lix[F) 

7. There are {a, /i} such that for each k, at least one of the following con- 
ditions holds: 

max,.^^|xJ(X/3(fe-i)-y)| 
max^g^lxfpf/SC^-D-y)! < 

(b) A^^-i) CF, 

(c) A(*^-i) D F. 

Then, when the procedure stops at step k — 1, with probability greater than 
1 — 2rj, the following hold: 

1. F^~^^ CF, _ _ _ 

2. |F-F(fc-i)|<2|{ieF:|/3^.| <3epx(F)-V^}|, 

3. 11/3(^-1) -/3x(F,y)||2< £px(F)-y|F-F(fe^i )|M 

4. \\^^(F,y)-Mo.<a^2log{2\F/ri)/inpx(F)). 

We omit the proof as it is very similar to the proof in Zhang (2009). 

6.3.1. Discussion of the result. The theorem states that when the proce- 
dure stops: (1) the selected predictors have truly nonzero /3j; (2) the number 
of false negatives is bounded by the number of small truly nonzero /3j — 
relative to the noise level; (3) the estimator is close to the best possible /3, 
which is estimated in the presence of noise using all the truly nonzero pre- 
dictors; and (4) the difference between the best estimate in the presence of 
noise and that of the true /3 is bounded. 

The proof of this result is based on induction at each step of the procedure. 
The extra conditions are motivated by the following analysis. We denote any 
predictor for which /3j = as a noise predictor and any predictor for which 

7^ as a signal predictor. When we consider adding a predictor in a step 
of forward stepwise regression, we consider two quantities: 

(19) max|xJ(X/3('^-^)-y)|, 

(20) mcix|xj (X/?^^-!) -y)|. 

i€F 

These are, respectively, proportional to the maximum correlation between 
the current residuals and a noise predictor and the maximum correlation 
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between the current residuals and a signal predictor. We refer to these two 
predictors as the "best" signal predictor and the "best" noise predictor. 

For CaSpaR, we must consider how the weights applied to these quantities 
affect the analysis. We therefore consider the cases where: (a) the best signal 



scenario (d), the original result for stepwise regression holds. We therefore 
make additional assumptions to ensure that case (d) does not occur. Those 
conditions are as follows: 

1. The ratio of the criterion for the best noise predictor to the best signal 
predictor is less than a. 

2. All of the predictors under the kernel are signal predictors. 

3. All of the signal predictors are under the kernel. 

The first ensures that in case (d) the correlation between the signal predictor 
is large enough to be selected even in this case. Because the weights Wj only 
depend on membership in A'^'^-', the second and third conditions ensure that 
case (d) never occurs: the second means there are only signal predictors 
in A^'^), and the third means that there are no signal predictors not in A^*^). 

These assumptions are fairly mild, especially if we have a strong belief 
that supp(/3) is truly structured. We propose that the first condition holds 
for early steps of CaSpaR. We can reasonably assume that it is possible for 
an oracle a to be such that the signal is sufficiently dominant over noise. The 
last two conditions should hold for later steps of the algorithm: enough points 
within each cluster have already been discovered so that it only remains to 
fill in the clusters. 

7. Conclusion. We introduced a new method, CaSpaR, that allows us to 
build sparse regression models where we have some additional information 
about the structure of the sparsity pattern. We presented an application as 
well as a simulation study that show the method performs differently than 
the most popular sparse regression techniques. We discussed the general 
concept of graph sparsity, and showed that, under high "signal-to-noise" 
conditions ||/3|| Vo" ~ 500, our method provides a flexible way to approximate 
graph sparsity. 

Our simulation study suggests that under structured sparsity conditions, 
CaSpaR can recover the true target with less data than standard techniques. 
This motivates future work to show that this property has a theoretical 
basis. Other topics of interest include adding backward steps to the CaSpaR 
algorithm as well as an extension to a convex minimization procedure, which 
may have some computational advantages over the stepwise procedure. 
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